Artificial intelligence in fracture detection with different image modalities and data types: A systematic review and meta-analysis

Artificial Intelligence (AI), encompassing Machine Learning and Deep Learning, has increasingly been applied to fracture detection using diverse imaging modalities and data types. This systematic review and meta-analysis aimed to assess the efficacy of AI in detecting fractures through various imaging modalities and data types (image, tabular, or both) and to synthesize the existing evidence related to AI-based fracture detection. Peer-reviewed studies developing and validating AI for fracture detection were identified through searches in multiple electronic databases without time limitations. A hierarchical meta-analysis model was used to calculate pooled sensitivity and specificity. A diagnostic accuracy quality assessment was performed to evaluate bias and applicability. Of the 66 eligible studies, 54 identified fractures using imaging-related data, nine using tabular data, and three using both. Vertebral fractures were the most common outcome (n = 20), followed by hip fractures (n = 18). Hip fractures exhibited the highest pooled sensitivity (92%; 95% CI: 87–96, p< 0.01) and specificity (90%; 95% CI: 85–93, p< 0.01). Pooled sensitivity and specificity using image data (92%; 95% CI: 90–94, p< 0.01; and 91%; 95% CI: 88–93, p < 0.01) were higher than those using tabular data (81%; 95% CI: 77–85, p< 0.01; and 83%; 95% CI: 76–88, p < 0.01), respectively. Radiographs demonstrated the highest pooled sensitivity (94%; 95% CI: 90–96, p < 0.01) and specificity (92%; 95% CI: 89–94, p< 0.01). Patient selection and reference standards were major concerns in assessing diagnostic accuracy for bias and applicability. AI displays high diagnostic accuracy for various fracture outcomes, indicating potential utility in healthcare systems for fracture diagnosis. However, enhanced transparency in reporting and adherence to standardized guidelines are necessary to improve the clinical applicability of AI. Review Registration: PROSPERO (CRD42021240359).


Introduction
Bone fractures represent a significant public health concern globally [1], particularly for individuals with osteoporosis [2].Fractures contribute to work absences, disability, reduced quality of life, health complications, and increased healthcare costs, affecting individuals, families, and societies [3,4].A meta-analysis of 113 studies reported the pooled cost of hospital treatment for a hip fracture after 12 months as $10,075, with total health and social care costs amounting to $43,669 per hip fracture [5].
However, existing systematic review and meta-analysis studies focused solely on imagebased analyses, neglecting comprehensive examination of various imaging modalities and data types (image, tabular, or both).Despite the superior performance of AI for medical image analysis and using tabular data, a critical gap exists in the current literature concerning the optimal choice of image modalities and the choice between image, tabular, or combined data types.There is a lack of comprehensive guidance on the most effective selection of image modalities and data types for fracture diagnosis.This gap in knowledge underscores the need for systematic investigation to determine which image modality, and by extension, which data type, yields the highest diagnostic accuracy and clinical relevance in AL algorithms.Addressing this gap will not only optimize the design of AI-based diagnostic tools but also enable healthcare practitioners to make informed decisions when selecting appropriate imaging modalities and data types for improved patient care.
Thus, this study primarily aims to evaluate the diagnostic accuracy of AI in fracture detection using diverse imaging modalities and data types, reflecting AI's growing role in healthcare.Additionally, we seek to synthesize current evidence on AI-based fracture detection, offering a concise overview and discerning the strengths and limitations of various data types, whether image, tabular, or combined.

Identification and selection of studies
This systematic review, registered with PROSPERO (CRD42021240359), follows PRISMA guidelines (S1 PRISMA Checklist) [17].We searched Medline (via PubMed), Web of Science, and IEEE.The last search was conducted on December 15, 2022, and we manually searched bibliographies, citations, and related articles of included studies.S1 Text lists each search term.Two independent reviewers (JJ and JD) assessed study eligibility, resolving disagreements through discussion or involving a third author (BL) if necessary.
Eligible studies predicted fracture outcomes using structured patient-level health data (electronic health records and cohort studies data) and image-related data (MRI, DXA, and X-ray).We excluded reviews, gray literature, non-human subject studies, studies without machine learning or deep learning models, fracture outcomes, AUC, accuracy, sensitivity, specificity, validation, and insufficient algorithm development details.We only considered studies published in English without time restrictions.

Data extraction
All three categories of data were considered: image-related, tabular, and both.Image-type studies used MRI, DXA, CT, or X-ray; tabular-type studies used structured electronic health records data; image and tabular studies used both data types.Two investigators (JJ and JD) independently evaluated study eligibility, extracting relevant data for articles meeting inclusion criteria.A structured data collection form was used to capture general study characteristics, population, data preprocessing, clinical outcomes, analytical methods, and results.A third author (BL) resolved discrepancies if necessary.We constructed the contingency table (true positive, true negative, false positive, and false negative) based on the provided information of sensitivity, specificity, positive predictive value, and negative predictive value for each study (S4 Table ).If the study reported multiple sensitivity and specificity, we used the highest sensitivity and specificity.

Statistical analysis
Meta-analyses were performed using a random-effects model to calculate the pooled sensitivity and specificity based on logit transformation [18,19], using the Clopper-Pearson interval to calculate 95% confidence intervals for each study [20].We used a unified hierarchical summary receiver operating characteristic curve (HSROC) to investigate the relationship between logit-transformed sensitivity and specificity.We calculated the diagnostic odds ratio and used inverse variance weighting for pooling with random effect models [21].

Sensitivity analysis
The logit transformation does not consider the correlation between sensitivity, specificity, and threshold effects; another model is desired to capture this missing part.Barendregt et al. [22] recommend using the Freeman-Tukey double arcsine transformation instead of the logit transformation.Hence, we used the Freeman-Tukey double arcsine transformation as a sensitivity analysis [22] for a random-effects model.

Subgroup analysis
Two subgroup analyses were conducted: 1) three data types (images, tabular, or images and tabular) and 2) different image modalities among image data used in AI.Statistical analysis was performed using R [23], with 'meta' [24] and 'mada' [25] packages.A p-value of < 0.05 was considered statistically significant.

Publication bias
We utilized the contour-enhanced funnel plot [26] to illustrate the assessment of publication bias for each fracture outcome and data type used.Each data point in the contour-enhanced funnel plot represents an individual study, and the plot incorporates contour lines that delineate expected areas of symmetry in the absence of bias.The plot provides insights into potential publication bias, with asymmetry suggesting a deviation from expected publication patterns.We employed the trim-and-fill method to address publication bias [22] further.This statistical approach helps adjust for the potential missing studies due to publication bias by imputing hypothetical "filled" studies and recalculating the effect size accordingly.

Risk of bias and applicability
Two reviewers (JJ and JD) independently evaluated the risk of bias in each study using Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) [27], assessing four domains: patient selection, index test, reference standard, and flow and timing.The risk of applicability was evaluated with the first three domains.

Study selection and characteristics
Our search identified 1,128 studies, yielding 717 unique ones after removing duplicates (Fig 1).We screened titles and abstracts and selected 496 studies for full-text review based on our inclusion criteria.We then excluded 254 studies for lacking sensitivity and specificity information (149 studies), not having fracture-related outcomes (75 studies), not using ML models (28 studies), or being survey or review articles (2 studies).We further removed 176 studies because no contingency table could be calculated from the provided information.Ultimately, 66 studies were included in our systematic review and meta-analysis.

AI algorithms summary
Among the 54 studies that utilized imaging-related data, convolutional neural networks (CNN), a deep learning approach, emerged as the predominant choice, followed by instances where transfer learning was adopted.In some cases, the limited availability of labeled image data prompted the utilization of transfer learning [53,69], and certain studies incorporated pre-trained CNNs with non-fracture-related radiological images [6,28,85].The prevailing preference was for fully connected artificial neural networks within the subset of nine studies involving tabular data.Logistic regression and ensemble learning models were commonly employed, including Random Forest, Gradient Boosting, and XGBoost.Among the three studies that harnessed both image and tabular data, a notable trend was the adoption of the support vector machine with various kernel models [57,68].

Hyperparameter optimization
Thirty-six studies reported the detailed process for optimizing hyperparameters in the final selected models (S3 Table ).Beyaz et al. utilized genetic algorithms to identify the optimal hyperparameters for their CNN architecture [67].Liu et al. explored the impact of varying the number of hidden neurons in the output layer [32].Nissinen et al. [72] employed two approaches for hyperparameter searches: random search [87] and hyperband [88].

Data split and validation in an external data set
Fifty-one studies reported the split sample for model development (training) and validation (testing) (S3

Publication bias
The assessment of publication bias encompassed each fracture outcome and the utilization of distinct data types (S5 and S6 Tables ,   Data in parentheses are 95% confidence intervals. 1): the logit transformation was used to calculate the pooled sensitivity and specificity.
2) : the arcsine transformation was used to calculate the pooled sensitivity and specificity.
2) : the arcsine transformation was used to calculate the pooled sensitivity and specificity.
UGWSI: Ultrasonic Guided Wave Spectrum Image, VFAI: Vertebral Fracture Assessment Image https://doi.org/10.1371/journal.pdig.0000438.t004 This asymmetry implies the presence of possible publication bias, particularly pronounced in studies with smaller sample sizes.However, the trim-and-fill method corrected this asymmetry, rendering the distribution symmetrical (S2 Fig and S3 Fig) .After using the trim-and-fill method to adjust for publication bias, the diagnostic odds ratio (DOR) has revealed that the effect size remains statistically significant (S5 and S6 Tables).

Risk of bias and applicability
The assessment of bias and applicability for 66 studies revealed moderate to low concerns (Table 5 and Fig 4).Patient selection and reference standards were the primary concerns for bias and applicability.Many studies lacked the reporting of sample characteristics such as gender and age, limiting generalizability.Some studies did not report patient selection or reference standard computation methods [62,75,78].Threshold adjustments in some studies might have led to overfitting, reducing the generalizability of the models [72].Most studies exhibited applicability concerns and needed to be more easily generalizable to other populations.For example, one study [66] focused on patients visiting the emergency department for acute proximal femoral fracture, limiting generalizability to the general population.Another study included patients with existing vertebral fractures, reducing generalizability to the general population.Data preprocessing often involves the removal of occult fractures, with some studies excluding radiographic occult fractures requiring additional modalities for confirmation [53].
Other studies excluded images with uncertain, traumatic, or pathological fractures or those with insufficient quality or resolution [58].A few studies did not provide specific locations for fracture types or specify which ones were included [12,70].

Discussion
Our systematic review and meta-analysis offer the most current and comprehensive evaluation of the diagnostic accuracy of Artificial Intelligence (AI) for predicting various osteoporotic fracture outcomes using various imaging modalities and data types.This study represents the first systematic review and quantitative meta-analysis of AI's diagnostic accuracy and comparison using different data types across multiple fracture outcomes.Our analysis reveals four major findings.First, AI provides high classification accuracy for fracture detection when utilizing imaging data, with a pooled sensitivity of 92% (95% CI: 90, 94).Convolutional neural networks with transfer learning exhibit significantly high accuracy when using image data in classifying fractures.Second, our study comprehensively reviews diagnostic accuracy among   96).Third, our sensitivity analysis, employing the arcsine transformation, which was complemented by the primary analysis utilizing the logit transformation, provides the robustness of our findings.Both methodologies yielded similar results regarding pooled sensitivity and specificity, which underscores the reliability and consistency of our findings.Fourth, significant flaws were observed in the study design and reporting of AI for real-world applicability.
For example, only a few studies described the patient characteristics of data, and only half (n = 33) reported the hyperparameter selection process.Our findings align with other systematic reviews and meta-analyses [15,16], showing that AI demonstrates considerably higher pooled sensitivity and specificity.However, inconsistent results have been observed when comparing different image modalities in fracture detection.External validation enables a more robust demonstration of clinical utility versus simple internal train/test cross-validation.Our study shows that only thirteen studies (20%) out of sixtysix performed external validation.The limitation of validating in an external dataset is the lack of availability of large, labeled datasets due to resistance to sharing data across institutions because of patient privacy issues and the necessity of experts for labeling the datasets.Although external validation enhances the robustness of AI systems, it could potentially attenuate their impact on the system.Consequently, it's crucial to acknowledge that external validation might not always be advisable due to the potential impact of factors like sample size and the diversity of the training set.Two systematic reviews [89,90] provide valuable insights into the current limitations of AI studies.A broad discussion of possible solutions is necessary because methodological challenges, risk of bias, and applicability concerns can arise in AI during all stages of development, including data curation, model selection, implementation, and validation.Both reviews recommend that researchers follow standardized reporting guidelines to determine the risk of bias and improve methodological quality assessment.Our study has limitations; the major one is that only a few studies that employed tabular data or combined tabular and image data are eligible.Second, we excluded non-English-language articles, which may have overlooked some studies published in a different language.Third, many of these included studies had study design flaws.They were classified as having great concern for bias and applicability, limiting the conclusions that could be drawn from the meta-analysis because studies with a high risk of bias and applicability overestimated algorithm performance.
This systematic review and meta-analysis have important implications for clinical practice.Given the high diagnostic performance of AI, these techniques could be integrated into existing fracture risk assessment tools to enhance the identification of patients at risk and facilitate early intervention.Healthcare professionals should be trained in interpreting and applying these methods in clinical practice.
This study observed superior prediction performance with single radiograph input data over multimodal imaging, which can be attributed to the radiographs' consistent and standardized anatomical view, reducing noise and variability inherent in multimodal inputs [91].Radiographs precisely capture fracture-relevant features, while added modalities like CT and MRI can diversify and possibly weaken these key features [92].Multimodal inputs can also elevate overfitting risks, particularly with limited datasets [93].Radiographs, being more accessible and cost-effective than CT or MRI, allow for larger, representative datasets enhancing model performance.The decision between single radiographs and multimodal inputs should be rooted in the research context, data availability, and prediction objectives.Despite the evident advantages of radiographs, specific scenarios may warrant multimodal integration for improved predictions.We also observed that solely relying on image data produced better AUC values than combining it with tabular data.Image data's richness and direct relevance to fracture detection offer clear diagnostic advantages [94].Convolutional neural networks (CNNs), identified in our study, are adept at processing this data, emphasizing subtle fracturerelated visual nuances [95].In contrast, tabular data could infuse noise and inconsistencies.Sole image data ensures focus on vital visual features and offers a more standardized data format than diverse tabular inputs.
Further research is needed to address the limitations identified in the included studies and to explore the performance of specific ML and DL algorithms.Researchers should provide more detailed information about their study populations and methods, including patient selection, fracture type location, and the reference standard used.Future studies should also investigate the impact of factors such as training dataset size, model architecture, and the inclusion of clinical and demographic variables on the diagnostic performance of AI.Future research will help develop more accurate and generalizable models for predicting osteoporotic fractures and inform evidence-based clinical practice.Several novel diagnostic meta-analysis methodologies have recently been introduced [96][97][98].Nevertheless, due to the limited sample sizes within selected studies focusing on fractures beyond vertebral and hip injuries and studies involving tabular and tabular and image data types, incorporating these methodologies into our present study was unfeasible.While we acknowledge their potential applicability, the current study's unique characteristics led us to refrain from their implementation.We will implement these methodologies in our forthcoming investigations, particularly as more comprehensive studies become available.In aid of future researchers, we provide an array of crucial challenges and their potential resolutions pertinent to applying machine learning or deep learning for fracture diagnosis (S7 Table ).
In conclusion, our meta-analysis highlights the high diagnostic accuracy of AI in various fracture outcomes.As AI demonstrates reliable results in fracture detection, it holds the potential to streamline fracture diagnosis in healthcare systems.However, transparent reporting of study methods and designs for AI development and validation is essential to ensure their realworld applicability.By addressing the current research landscape's limitations and promoting standardized guidelines, we can facilitate the integration of AI technologies into clinical practice and enhance the prediction of osteoporotic fractures, ultimately leading to improved patient care.within each contour-enhanced funnel plot in the second row.(DOCX)

Fig 1 .
Fig 1. Flow chart of the literature selection in PubMed, Web of Science, and Institute of Electrical and Electronics Engineers (search conducted on December 15, 2022).*IEEE: Institute of Electrical and Electronics Engineers.https://doi.org/10.1371/journal.pdig.0000438.g001

Fig 4 .
Fig 4. Summary of the Quality Assessment of Diagnostic Accuracy Studies for the risk of bias and applicability in the included 66 studies.The risk of bias was measured in four domains: patient selection, index test, reference standard, and flow and timing.The risk of applicability was evaluated with three domains: patient selection, index test, and reference.https://doi.org/10.1371/journal.pdig.0000438.g004

Table 2 . Pooled Sensitivities, Specificities, and Diagnostic Odds Ratio for 60 studies in different fractures outcome.
Studies with only one selected fracture outcome (cervical spine, hand, lumber spine, proximal humerus, supracondylar, and trabecular bone) were omitted.

Table 4 . Pooled sensitivities, specifications, and diagnostic odds ratios for 54 studies (including three from the tabular and image data used) in different image modalities.
Studies with only one selected image modality (Radiograph + CT + MRI, Radiograph + MRI, UGWSI) were omitted.